Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replica unable trigger migration when it received CLUSTER SETSLOT in advance #981

Merged
merged 11 commits into from
Sep 13, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Sep 2, 2024

When the primary and replica cluster-allow-replica-migration configuration items are
different, the replica migration does not meet expectations.

Originally: primary not allow, replica allow. In this case, the code will migrate the
replica away, and the primary will keep as an empty primary. However, in #970, we found
a timing issue. For example, if the replica receives CLUSTER SETSLOT first and then receives
the gossip change, it will not trigger the replica-migration.

We perform replica migration explicitly in CLUSTER SETSLOT in this case.

Closes #970.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 2, 2024
@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Sep 2, 2024

logs (to see the full logs in https://github.com/valkey-io/valkey/actions/runs/10658482333/job/29539601550):

replica
"""
 |####+'     .+#######+.     '+####|    Port: 27173
 |###|   .+###############+.   |###|    PID: 30240             
30240:M 02 Sep 2024 00:15:27.283 * No cluster configuration found, I'm 69c0773b7994df742f8a37862b06f32bebfeaee1


### Starting test valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT in tests/unit/cluster/replica-migration.tcl
30240:S 02 Sep 2024 00:15:39.789 - Accepted 127.0.0.1:37220
30240:S 02 Sep 2024 00:15:39.792 * Migrating slot 0 to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c ()
30240:S 02 Sep 2024 00:15:39.794 * Assigning slot 0 to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () in shard fe55243c7e785a1e139292a70793f52b5d83da85
30240:S 02 Sep 2024 00:15:39.806 - Client closed connection id=8 addr=127.0.0.1:37220 laddr=127.0.0.1:27173 fd=28 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=37888 events=r cmd=cluster|nodes user=default redir=-1 resp=3 lib-name= lib-ver= tot-net-in=56 tot-net-out=1982 tot-cmds=2
30240:S 02 Sep 2024 00:16:31.690 - Client closed connection id=6 addr=127.0.0.1:27177 laddr=127.0.0.1:39568 fd=21 name= age=61 idle=2 flags=M db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=40954 argv-mem=0 multi-mem=0 rbs=1024 rbp=36 obl=0 oll=0 omem=0 tot-mem=43008 events=r cmd=ping user=(superuser) redir=-1 resp=3 lib-name= lib-ver= tot-net-in=374 tot-net-out=2283 tot-cmds=11
30240:S 02 Sep 2024 00:16:31.690 * Connection with primary lost.
30240:S 02 Sep 2024 00:16:31.690 * Caching the disconnected primary state.
30240:S 02 Sep 2024 00:16:31.690 * Reconnecting to PRIMARY 127.0.0.1:27177
30240:S 02 Sep 2024 00:16:31.690 * PRIMARY <-> REPLICA sync started
30240:S 02 Sep 2024 00:16:31.690 # Error condition on socket for SYNC: Connection refused
30240:S 02 Sep 2024 00:16:31.787 * Connecting to PRIMARY 127.0.0.1:27177
30240:S 02 Sep 2024 00:16:31.787 * PRIMARY <-> REPLICA sync started
30240:S 02 Sep 2024 00:16:31.787 # Error condition on socket for SYNC: Connection refused
30240:S 02 Sep 2024 00:16:31.989 * NODE 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () possibly failing.
30240:S 02 Sep 2024 00:16:32.293 * NODE ee32166e30d115bb4c0ab65b192919235b722eec () possibly failing.
30240:S 02 Sep 2024 00:16:32.293 # Cluster state changed: fail
"""


primary
"""

    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in cluster mode
 |####+'     .+#######+.     '+####|    Port: 27177
 |###|   .+###############+.   |###|    PID: 30370                     
30370:M 02 Sep 2024 00:15:28.690 * No cluster configuration found, I'm 53cd7df0480f96d7b0146bbffc3b599675d00bc0

### Starting test valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT in tests/unit/cluster/replica-migration.tcl
30370:M 02 Sep 2024 00:15:39.788 - Accepted 127.0.0.1:55280
30370:M 02 Sep 2024 00:15:39.792 * Migrating slot 0 to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c ()
30370:M 02 Sep 2024 00:15:39.794 * Slot 0 is migrated from node 53cd7df0480f96d7b0146bbffc3b599675d00bc0 () in shard 5342c5f1e5dacf2e24c0ab2a8300c51713b9f90c to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () in shard fe55243c7e785a1e139292a70793f52b5d83da85.
30370:M 02 Sep 2024 00:15:39.794 * Slot 0 is no longer being migrated to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () in shard fe55243c7e785a1e139292a70793f52b5d83da85.
30370:M 02 Sep 2024 00:15:39.794 * My last slot was migrated to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () in shard fe55243c7e785a1e139292a70793f52b5d83da85. I am now an empty primary.
30370:M 02 Sep 2024 00:15:39.801 * Assigning slot 0 to node 6e5a9dab27e4c4b56b63ad890d41bed40e7c0b7c () in shard fe55243c7e785a1e139292a70793f52b5d83da85
"""

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.46%. Comparing base (fa348e2) to head (f9abd82).
Report is 8 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 85.71% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #981      +/-   ##
============================================
- Coverage     70.57%   70.46%   -0.12%     
============================================
  Files           114      114              
  Lines         61660    61664       +4     
============================================
- Hits          43515    43449      -66     
- Misses        18145    18215      +70     
Files with missing lines Coverage Δ
src/networking.c 88.41% <100.00%> (+0.15%) ⬆️
src/replication.c 87.08% <100.00%> (+0.12%) ⬆️
src/server.h 100.00% <ø> (ø)
src/cluster_legacy.c 86.39% <85.71%> (+0.23%) ⬆️

... and 9 files with indirect coverage changes

@PingXie
Copy link
Member

PingXie commented Sep 3, 2024

@enjoy-binbin, your proposal looks intuitive and reasonable but I don't know if it matches the original intent of this flag.

@madolson you mentioned in today's meeting that, in a k8s deployment that lacks a control plane, the engine is expected to make node placement decisions. This makes sense on a high level but I still don't think I understand the purpose of this flag. This flag is effective only in a scale down scenario. So with or without this flag, we will end up with surplus k8s nodes in the end, which is likely the ultimate purpose of this scale down operation. The difference is that without this flag, these surplus pods will be hosting empty shards; while with this flag, we don't have empty shards but we will have more replicas. So if the goal is to scale down the k8s cluster, I would have thought that it is easier to identify empty shards/pods than trying to remove extra replicas? I must've missed some important context.

BTW, @enjoy-binbin, I believe some of the recent tests we worked on actually took dependency on this split migration policy (so we could control the failover behavior/timing). I wonder if they will be broken by this PR.

In any case, this will be a breaking change.

@enjoy-binbin
Copy link
Member Author

your proposal looks intuitive and reasonable but I don't know if it matches the original intent of this flag.

yes, indeed, i'm a bit fuzzy about this too.

I believe some of the recent tests we worked on actually took dependency on this split migration policy (so we could control the failover behavior/timing). I wonder if they will be broken by this PR.

the tests seems to be ok, it passed in the CI.

I think under normal cases, the configurations is consistencies. And in our tests, we make it inconsistencies so here we are. There is a case, primary allow and the replica are not allow, what will happen if primary lose its last slot? In our code (both unstable and this branch), they both will become a replica, in this case, what does replica-migration not allow means in this replica side?

I think when redis/redis#5285 was introduced, maybe a better way is added a allow_primary_migration and a allow_replica_migration

If we worry too much, we can skip it. It's also good that the failed test exposed the problem so we can think about it later. (It should have been there before i think)

Originally: primary not allow, replica allow. In this case, the code will migrate the
replica away, and the primary will keep as an empty primary. However, in #970, we found
a timing issue. For example, if the replica receives CLUSTER SETSLOT first and then receives
the gossip change, it will not trigger the replica-migration.

@madolson
Copy link
Member

madolson commented Sep 3, 2024

@madolson you mentioned in today's meeting that, in a k8s deployment that lacks a control plane, the engine is expected to make node placement decisions. This makes sense on a high level but I still don't think I understand the purpose of this flag. This flag is effective only in a scale down scenario. So with or without this flag, we will end up with surplus k8s nodes in the end, which is likely the ultimate purpose of this scale down operation. The difference is that without this flag, these surplus pods will be hosting empty shards; while with this flag, we don't have empty shards but we will have more replicas. So if the goal is to scale down the k8s cluster, I would have thought that it is easier to identify empty shards/pods than trying to remove extra replicas? I must've missed some important context.

I was talking about replica barrier in conjunction with replica migration. The idea is that you say have 7 total pods, and then 3 primaries each with a replica. You set the replica barrier to 1. If a node fails, that shard will be down to one node, so the extra replica will migrate over and fill in the missing nodes as the other pod may eventually come online. FWIW, I know very few people that do this.

However, some folks like to have strict shards, and they don't want replicas to ever migrate between nodes. This flag was originally introduced to handle the edge case where someone wanted to migrate data away from the shard then delete the pods. They didn't want the replicas to follow.

So the tl;dr is that replica migration has larger implications, but this flag was introduced to solve a specific problem.

Better handling of replica-migration when primary and replica have different cluster-allow-replica-migration

My honest impression of this configuration is that I don't care if they're different. Like, you can do a lot of things that are inconsistent. I would rather warn users about inconsistent configs then try to handle it perfectly.

@enjoy-binbin
Copy link
Member Author

Discuss this with @PingXie offline, we will remove it from 8.0 and will consider it in the future. We will ignore this case in the failed test so that we can get a green CI.

… in advance

When the primary and replica cluster-allow-replica-migration configuration items are
different, the replica migration does not meet expectations.

Originally: primary not allow, replica allow. In this case, the code will migrate the
replica away, and the primary will keep as an empty primary. However, in valkey-io#970, we found
a timing issue. For example, if the replica receives CLUSTER SETSLOT first and then receives
the gossip change, it will not trigger the replica-migration.

We perform replica migration explicitly in CLUSTER SETSLOT in this case.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin changed the title Better handling of replica-migration when primary and replica have different cluster-allow-replica-migration Fix replica unable trigger migration when it received CLUSTER SETSLOT in advance Sep 11, 2024
@enjoy-binbin
Copy link
Member Author

I backup the old code to a different branch: https://github.com/enjoy-binbin/valkey/tree/better_replica_migration_bak

@PingXie I guess we can get this change into 8.0? I test it in local and it work.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 11, 2024
…migration

Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The hostname test seems to have failed quite consistently?

src/cluster_legacy.c Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

the hostname test got improved in #1016, i have rebase it, hope it can solve it.

Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
src/cluster_legacy.c Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
src/cluster_legacy.c Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

@madolson @zuiderkwast do you guys wanna take a look with this one too? in case i mess up something.

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/replication.c Outdated Show resolved Hide resolved
enjoy-binbin and others added 2 commits September 13, 2024 13:44
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@PingXie
Copy link
Member

PingXie commented Sep 13, 2024

sorry my empty shard tests are still failing. let me disable them again to unblock others.

@PingXie PingXie merged commit dcc7678 into valkey-io:unstable Sep 13, 2024
56 checks passed
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 13, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different 
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if 
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains 
in the original shard. This happens because we only process the 
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the 
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
@PingXie
Copy link
Member

PingXie commented Sep 13, 2024

We have seen more and more failures of #970 in both unstable and 8.0 so I went ahead and merged this PR. We can continue the discussion asynchronously.

@enjoy-binbin enjoy-binbin deleted the better_replica_migration branch September 14, 2024 00:18
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Sep 14, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different 
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if 
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains 
in the original shard. This happens because we only process the 
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the 
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
@enjoy-binbin enjoy-binbin mentioned this pull request Sep 14, 2024
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie added a commit that referenced this pull request Sep 15, 2024
… in advance (#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue #970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes #970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
naglera pushed a commit to naglera/placeholderkv that referenced this pull request Oct 10, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: naglera <anagler123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Test failure] valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
3 participants